Movie Rating Model and Predictor

Part 5: Modeling

The aim at this stage was to develop two prediction models. Model One, a simple linear regression model identified which of the quantitative variables was the best predictor of the log of daily box office revenue. Ultimately, movie popularity must translate to revenue. This model would be used in the prediction section for this purpose. The best predictor for the log of daily box office revenue identified by Model One was designated the response variable for Model Two: a multiregression model that selected the best predictors for the designated response variable. The latter was the best performing of four multiregression models, developed using both forward selection and backward elimination method selection methods. These four models and their model selection methods were:
Table 1: Multiregression prediction models
Model Model.Selection Data
Alpha Forward Selection Full model
Beta Forward Selection Full model, influential outliers removed
Gamma Backward Elimination Full model
Delta Backward Elimination Full model, influential outliers removed

The remainder of this sections is organized as follows.
1. Model One: Simple Linear Regression Model
1.1. Model Selection
1.2. Model Diagnostics
1.3. Model Interpretation
2. Model Two: Multiregression Model
2.1. Model Selection Methods
2.2. Full Model
2.3. Model Alpha
2.4. Model Beta
2.5. Model Gamma
2.6. Model Delta
2.7. Model Comparison
2.8. Final Model

2.5. Model Interpretation

Model One: Simple Linear Regression

Model Selection

Several simple linear models were fit to determine which of the following quantititive variables in Table 2 was the best predictor of the log of daily box office revenue.

Table 3: Simple linear regression variables
Variable Description
audience_score Audience score on Rotten Tomatoes
cast_experience The sum across all cast members for a film, of the number of films in which each actor appeared
cast_experience_log Log of the sum across all cast members for a film, of the number of films in which each actor appeared
cast_votes Total number of allocated IMDB votes per day for the cast of a film
cast_votes_log Log of cast_votes
critics_score Critics score on Rotten Tomatoes
director_experience Total number of films in sample for a director
director_experience_log Log of the total number of films directed by the film’s director
imdb_num_votes Number of votes on IMDB
imdb_rating Rating on IMDB
runtime Runtime of movie (in minutes)
runtime_log Log runtime of movie (in minutes)
thtr_days Number of days from theatre release date to January 1, 2016
thtr_days_log Log of thtr_days
votes_per_day The number of IMDB Votes / thtr_days
votes_per_day_log Log of votes_per_day

As suggested by the correlation analysis in Table 4 and summarized in Table 5 the log number of IMDB votes was the best predictor of the log of daily box office revenue (F(1, 212) = 203.17, p < .001), with an adjusted R-Squared of 0.487. The model accounted for 49% of the variance in the response.

Table 5: Best performing simple linear regression on log of box office revenue
Model Size df df Residuals F Statistic RMSE Residual SE R-Squared Adj R-Squared p-value % Variance
imdb_num_votes_log 1 2 212 203.17 2.56 2.56 0.49 0.49 0 48.94
votes_per_day_log 1 2 212 121.97 2.85 2.85 0.37 0.36 0 36.52
votes_per_day_scores_log 1 2 212 110.92 2.90 2.90 0.34 0.34 0 34.35
cast_votes_log 1 2 212 70.69 3.10 3.10 0.25 0.25 0 25.01
imdb_num_votes 1 2 212 60.49 3.15 3.15 0.22 0.22 0 22.20

Model Diagnostics

Linearity

The linearity of the predictor with the log of daily box office is illustrated in Figure 1.

Figure 1 Model One linearity plot

A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(2), p < .001). As such, the linearity assumption was met in this case.

Homoscedasticity

The following plot (Figure 2) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 2 Model One homoscedasticity plot

The residuals plot above indicated equal dispersion of residuals about zero mean. A Breusch–Pagan test test was conducted to test the homoscedasticity assumption. The results were significant (F(1), p < .001). As such, the homoscedasticity assumption was met in this case.

Residuals

The histogram and the normal Q-Q plot in Figure 3 illustrate the distribution of residuals.

Figure 3 Model One residuals plot

The histogram and normal Q-Q plot suggested a nearly normal distribution of residuals. A review of the Shapiro-Wilk test (SW = 0.964, p = 0) and the skewness (-0.553) and kurtosis (3.885) supported the assumption of normaility.

Outliers

Figure 4 Model One Outliers

Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 14 cases exerting undue influence on the model. A case-wise review of the influential points did not reveal any data quality issues; therefore, the influential points were not removed from the model.

Model Interpretation

The final prediction equation was defined as follows:
\(y_i\) = -6.86 + 1.22\(x_1\) + \(\epsilon\)

where:
\(x_1\) is imdb_num_votes_log

Analysis of Variance

Figure 5 summarizes the analysis of variance.
Term Df Sum Sq Mean Sq F Statistic Pr(>F) % Var
imdb_num_votes_log 1 1326.734 1326.734 203.17 0 48.94
Residuals 212 1384.395 6.530 NA NA 51.06

Figure 5 Model Alpha analysis of variance

A two-way analysis of variance was conducted on the influence of 1 independent variable on the log daily box office. The force of imdb_num_votes_log on the log daily box office yielded an F statistic of F(1, 212), = 203.17, p < .001, accounting for 48.94% of the variance. Finally, residuals expressed a 51.06% of variance. The model was significant (F(2, 212) = 203.17, p < .001), with an adjusted R-squared of 0.487.

Interpretation of Coefficients

The intercept -6.86 is the prediction of log daily box office revenue for a film where the log number of IMDB votes is zero. The prediction of the log daily box office (in log dollars) is therefore, -6.86 plus 1.22 log dollars of daily box office revenue for each log IMDB vote.

Model Two: Multiple Linear Regression

Model Two was the best performing of models Alpha, Beta, Gamma, and Delta. The following provides an overview of the model selection methods used, then each model is described and diagnosed vis-a-vis assumptions of linearity, homoscedasticity, normality of errors, multicollinearity, and the treatment of influential points.

Model Selection Methods

Both forward selection and backward elimination with p-values model selection techniques were used. The forward selection approach optimized adjusted r-squared; whereas the backward elimination method was based upon p-values.

Forward Selection

The forward selection process began with a null model then all variables were added to the model, one-by-one, and the model which provided the greatest improvement over the current best adjusted R-squared was selected. The process repeated with each variable that was not already in the model until all variables were analyzed. Only the models that improved adjusted r-squared were retained at each step.

Backward Elimination

The backward elimination approach began with the full model. A regression analysis was performed and the least significant predictor (that with the highest p-value) was removed from the model. This process repeated, removie only the most least significant predictor at each step, until all predictors had p-values below the present threshold.

Full Model Selection

Since the objective of the analysis was to determine what factors make a movie popular, the full model did not include variables that could be considered proxies of popularity such as audience rating or IMDB rating. Such ratings are measures of a film’s popularity, not predictors. Critics rating, on the other hand, was considered not a measure, but a potential leading indicator of movie popularity. Similarly, effort was made to capture the popularity of specific cast members to test the hypothesis that a cast’s aggregate popularity could influence the popularity of a film. That said, the criteria for excluding a variable from the full model was as follows:
* Measures of film popularity such as the audience rating, IMDB rating and top 200 box office variables
* Categorical variables with levels including less than 5 observations, such as title, url, studio, and the actor variables
* The year and day of theatrical or dvd release

As such, the following full model is presented in Table 6.
Type Variable Description
Categorical best_actor_win Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie
Categorical best_actress_win Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the actresses won an Oscar for their role in the given movie
Categorical best_dir_win Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie
Categorical best_pic_nom Whether or not the movie was nominated for a best picture Oscar (no, yes)
Categorical best_pic_win Whether or not the movie won a best picture Oscar (no, yes)
Categorical genre Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)
Categorical mpaa_rating MPAA rating of the movie (G, PG, PG-13, R, Unrated)
Categorical thtr_rel_month Month the movie is released in theaters
Numeric cast_experience_log Log of the sum across all cast members for a film, of the number of films in which each actor appeared
Numeric cast_votes_log Log of cast_votes
Numeric critics_score Critics score on Rotten Tomatoes
Numeric director_experience_log Log of the total number of films directed by the film’s director
Numeric runtime_log Log runtime of movie (in minutes)
Numeric thtr_days Number of days from theatre release date to January 1, 2016

The following sections explore various models, model selection techniques, and model diagnostics. Comparisons are conducted and the models are evaluated on test data for prediction accuracy and stability. Lastly, the best performing model is selected and described on detail.

Model Alpha

For this model, a forward selection procedure was undertaken based upon the full model described above. Table 7 lists the variables in the order in which they were added.

Table 7: Model Alpha forward selection process
Step Selected Model.Size DF F.statistic R.Squared Adjusted.R2 p.value Pct Chg
1 cast_votes_log 1 2 489 581.22 0.54 0.54 0 0.00
2 genre 2 12 479 58.83 0.57 0.56 0 4.24
3 critics_score 3 13 478 60.66 0.60 0.59 0 5.13
4 cast_experience_log 4 14 477 59.35 0.62 0.61 0 2.36
5 best_pic_nom 5 15 476 57.96 0.63 0.62 0 1.81
6 director_experience_log 6 16 475 55.99 0.64 0.63 0 1.29
7 thtr_rel_month 7 27 464 33.17 0.65 0.63 0 0.64
8 best_dir_win 8 28 463 32.24 0.65 0.63 0 0.32
9 runtime_log 9 29 462 31.38 0.66 0.64 0 0.32
10 mpaa_rating 10 33 458 27.72 0.66 0.64 0 0.16

As indicated in Table 8 and graphically depicted in Figure 6, the model was significant (F(33, 458) = 27.718, p < .001), with an adjusted R-squared of 0.636.

Table 8: Model Alpha Summary Statistics
Model Size df df Residuals F Statistic RMSE Residual SE R-Squared Adj R-Squared p-value % Variance
Model Alpha 10 33 458 27.718 1.433 1.433 0.659 0.636 0 65.948

Figure 6 Model Alpha Regression

Model Diagnostics

Linearity

The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 7.

Figure 7 Model Alpha linearity plots

A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(33), p < .001). As such, the linearity assumption was met in this case.

Homoscedasticity

The following plot (Figure 8) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 8 Model Alpha homoscedasticity plot

The residuals plot above indicated equal dispersion of residuals about zero mean. A Breusch–Pagan test test was conducted to test the homoscedasticity assumption. The results were significant (F(1), p < .05). As such, the homoscedasticity assumption was met in this case.

Residuals

The histogram and the normal Q-Q plot in Figure 9 illustrate the distribution of residuals.

Figure 9 Model Alpha residuals plot

The histogram and normal Q-Q plot suggested a nearly normal distribution of residuals. A review of the Shapiro-Wilk test (SW = 0.988, p = 0) and the skewness (-0.411) and kurtosis (3.02) supported the assumption of normaility.

Multicollinearity

As shown in Figure 10 and Table 9, collinearity did not appear extant for this model. Variance inflation factors were computed for each predictor in the model. The maximum VIF of -Inf did not exceed the threshold of 4. As such, the absense of multicollinearity was assumed for this model.
Figure 10: Model Alpha correlations among quantitative predictors

Table 9 Model Alpha variance inflation Factors
Outliers

Figure 11 Model Alpha Outliers

Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 25 cases exerting undue influence on the model. The discern the effect of these outliers on the model, a new model (Model B) was created without the outliers removed.

Model Beta

This was also a forward selecion model; however, it was based upon the full model with outliers from Model Alpha removed. The variables were added as described in Table 10

Table 10: Model Beta forward selection process
Step Selected Model.Size DF F.statistic R.Squared Adjusted.R2 p.value Pct Chg
1 cast_votes_log 1 2 464 699.50 0.60 0.60 0 0.00
2 genre 2 12 454 70.48 0.63 0.62 0 3.67
3 critics_score 3 13 453 72.36 0.66 0.65 0 4.18
4 best_pic_nom 4 14 452 70.06 0.67 0.66 0 1.70
5 director_experience_log 5 15 451 68.38 0.68 0.67 0 1.67
6 cast_experience_log 6 16 450 68.00 0.69 0.68 0 2.09
7 thtr_rel_month 7 27 439 40.21 0.70 0.69 0 0.44
8 runtime_log 8 28 438 39.29 0.71 0.69 0 0.44
9 mpaa_rating 9 32 434 34.52 0.71 0.69 0 0.14
10 best_dir_win 10 33 433 33.70 0.71 0.69 0 0.14
11 best_actor_win 11 34 432 32.77 0.72 0.69 0 0.14

As indicated in Table 11 and graphically depicted in Figure 12, the model was significant (F(34, 432) = 32.773, p < .001), with an adjusted R-squared of 0.693.

Table 11: Model Beta Summary Statistics
Model Size df df Residuals F Statistic RMSE Residual SE R-Squared Adj R-Squared p-value % Variance
Model Beta 11 34 432 32.773 1.286 1.286 0.715 0.693 0 71.457

Figure 12 Model Beta Regression

Model Diagnostics

Linearity

The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 13.

Figure 13 Model Beta linearity plots

A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(34), p < .001). As such, the linearity assumption was met in this case.

Homoscedasticity

The following plot (Figure 14) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 14 Model Beta homoscedasticity plot

The residuals plot above indicated equal dispersion of residuals about zero mean. A Breusch–Pagan test test was conducted to test the homoscedasticity assumption. The results were significant (F(1), p < .05). As such, the homoscedasticity assumption was met in this case.

Residuals

The histogram and the normal Q-Q plot in Figure 15 illustrate the distribution of residuals.

Figure 15 Model Beta residuals plot

The histogram and normal Q-Q plot suggested a nearly normal distribution of residuals. A review of the Shapiro-Wilk test (SW = 0.99, p = 0.003) and the skewness (-0.304) and kurtosis (2.735) supported the assumption of normaility.

Multicollinearity

As shown in Figure 16 and Table 12, collinearity did not appear extant for this model. Variance inflation factors were computed for each predictor in the model. The maximum VIF of -Inf did not exceed the threshold of 4. As such, the absense of multicollinearity was assumed for this model.
Figure 16: Correlations among quantitative predictors

Table 12 Model Beta Variance Inflation Factors
Outliers

Figure 17 Model Beta Outliers

Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 20 cases exerting undue influence on the model. A case-wise review of the influential points did not reveal any data quality issues; therefore, the influential points would not be removed from the model.

Model Gamma

For this model, a backward elimination procedure was undertaken based upon the full model The variables were removed as described in Table 13

Table 13: Model Gamma

           Removed Size    Adj.R2    p.value
1 best_actress_win   14 0.6341075 0.09941808
                  Removed Size    Adj.R2    p.value
1        best_actress_win   14 0.6341075 0.09941808
2 director_experience_log   13 0.6347682 0.05274052
Steps Removed p.value
1 best_actress_win 0.10
2 director_experience_log 0.05

The model therefore retained the following variables:

Table 14 Model Gamma Variables
Variable Description
genre Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)
runtime_log Log runtime of movie (in minutes)
mpaa_rating MPAA rating of the movie (G, PG, PG-13, R, Unrated)
thtr_rel_month Month the movie is released in theaters
thtr_days Number of days from theatre release date to January 1, 2016
cast_experience_log Log of the sum across all cast members for a film, of the number of films in which each actor appeared
critics_score Critics score on Rotten Tomatoes
best_pic_nom Whether or not the movie was nominated for a best picture Oscar (no, yes)
best_pic_win Whether or not the movie won a best picture Oscar (no, yes)
best_actor_win Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie
best_dir_win Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie
cast_votes_log Log of cast_votes

As indicated in Table 15 and graphically depicted in Figure 18, the model was significant (F(35, 456) = 25.784, p < .001), with an adjusted R-squared of 0.632.

Table 15 Model Gamma Summary Statistics
Model Size df df Residuals F Statistic RMSE Residual SE R-Squared Adj R-Squared p-value % Variance
Model Gamma 12 35 456 25.784 1.44 1.44 0.658 0.632 0 65.783

Figure 18 Model Gamma Regression

Model Diagnostics

Linearity

The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 19.

Figure 19 Model Gamma linearity plots

A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(35), p < .001). As such, the linearity assumption was met in this case.

Homoscedasticity

The following plot (Figure 20) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 20 Model Gamma homoscedasticity plot

The residuals plot above indicated equal dispersion of residuals about zero mean. A Breusch–Pagan test test was conducted to test the homoscedasticity assumption. The results were significant (F(1), p < .05). As such, the homoscedasticity assumption was met in this case.

Residuals

The histogram and the normal Q-Q plot in Figure 21 illustrate the distribution of residuals.

Figure 21 Model Gamma residuals plot

The histogram and normal Q-Q plot suggested a nearly normal distribution of residuals. A review of the Shapiro-Wilk test (SW = 0.99, p = 0.003) and the skewness (-0.369) and kurtosis (2.999) supported the assumption of normaility.

Multicollinearity

As shown in Figure 22 and Table 16, collinearity did not appear extant for this model. Variance inflation factors were computed for each predictor in the model. The maximum VIF of -Inf did not exceed the threshold of 4. As such, the absense of multicollinearity was assumed for this model.
Figure 22: Correlations among quantitative predictors

Table 16 Model Gamma Variance Inflation Factors
Outliers

Figure 23 Model Gamma Outliers

Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 24 cases exerting undue influence on the model. To discern the effect of the influential points on the model, a new model (Model Delta) was created without the influential points of this model.

Model Delta

This was also a backward elimination model; however, it was based upon the full model with outliers from Model Gamma removed. The variables were removed as described in Table 17

Table 17: Model Delta

The model therefore retained the following variables:

Table 18 Model Delta Variables
Variable Description
genre Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)
runtime_log Log runtime of movie (in minutes)
mpaa_rating MPAA rating of the movie (G, PG, PG-13, R, Unrated)
thtr_rel_month Month the movie is released in theaters
thtr_days Number of days from theatre release date to January 1, 2016
director_experience_log Log of the total number of films directed by the film’s director
cast_experience_log Log of the sum across all cast members for a film, of the number of films in which each actor appeared
critics_score Critics score on Rotten Tomatoes
best_pic_nom Whether or not the movie was nominated for a best picture Oscar (no, yes)
best_pic_win Whether or not the movie won a best picture Oscar (no, yes)
best_actor_win Whether or not one of the main actors in the movie ever won an Oscar (no, yes) – note that this is not necessarily whether the actor won an Oscar for their role in the given movie
best_actress_win Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the actresses won an Oscar for their role in the given movie
best_dir_win Whether or not the director of the movie ever won an Oscar (no, yes) – not that this is not necessarily whether the director won an Oscar for the given movie
cast_votes_log Log of cast_votes

As indicated in Table 19 and graphically depicted in Figure 24, the model was significant (F(37, 430) = 29.896, p < .001), with an adjusted R-squared of 0.691.

Table 19
Model Size df df Residuals F Statistic RMSE Residual SE R-Squared Adj R-Squared p-value % Variance
Model Delta 14 37 430 29.896 1.285 1.285 0.715 0.691 0 71.452

Figure 24 Model Delta Regression

Model Diagnostics

Linearity

The linearity of each predictor with the log number of IMDB votes is illustrated in Figure 25.

Figure 25 Model Delta linearity plots

A review of the partial scatterplots indicated that linearity was a reasonable assumption for this model (despite the presence of several influential points). A linear hypothesis test was conducted to test the linearity assumption. The results were significant (F(37), p < .001). As such, the linearity assumption was met in this case.

Homoscedasticity

The following plot (Figure 26) of the residuals versus the fitted values provides a graphic indication of the distribution of residual variances. Figure 26 Model Delta homoscedasticity plot

The residuals plot above indicated equal dispersion of residuals about zero mean. A Breusch–Pagan test test was conducted to test the homoscedasticity assumption. The results were significant (F(1), p < .01). As such, the homoscedasticity assumption was met in this case.

Residuals

The histogram and the normal Q-Q plot in Figure 27 illustrate the distribution of residuals.

Figure 27 Model Delta residuals plot

The histogram and normal Q-Q plot suggested a nearly normal distribution of residuals. A review of the Shapiro-Wilk test (SW = 0.991, p = 0.006) and the skewness (-0.281) and kurtosis (2.71) supported the assumption of normaility.

Multicollinearity

As shown in Figure 28 and Table 20, collinearity did not appear extant for this model. Variance inflation factors were computed for each predictor in the model. The maximum VIF of -Inf did not exceed the threshold of 4. As such, the absense of multicollinearity was assumed for this model.
Figure 28: Correlations among quantitative predictors

Table 20 Model Delta Variance Inflation Factors
Outliers

Figure 29 Model Delta Outliers

Examination of the residuals versus leverage plot and case-wise diagnostics such as Cook’s distance revealed 17 cases exerting undue influence on the model. A case-wise review of the influential points did not reveal any data quality issues; therefore, the influential points would not be removed from the model.

Model Comparisons

To summarize, models Alpha and Beta were constructed using forward selection and models Gamma and Delta were developed via backward elimination. Models Beta and Delta were fitted without the influential data points from models Alpha and Gamma respectively.

Table 21 Summary of models
Model Size df df Residuals F Statistic RMSE Residual SE R-Squared Adj R-Squared p-value % Variance
Model Alpha 10 33 458 27.718 1.433 1.433 0.659 0.636 0 65.948
Model Beta 11 34 432 32.773 1.286 1.286 0.715 0.693 0 71.457
Model Gamma 12 35 456 25.784 1.440 1.440 0.658 0.632 0 65.783
Model Delta 14 37 430 29.896 1.285 1.285 0.715 0.691 0 71.452

Forward Selection vs. Backward Elimination

As shown in Table 21, the forward selection algorithm produced fewer predictors than the backward elimination algorithm. Notwithstanding, the differences in root mean square error for the models was not significant 0.46% and 0.13%. Similarly, the differences in adjusted R-squared were 0.53% and -0.31%, not a significant difference. Lastly the differences in the percent variance explained by the models also lacking in significance (-0.25% and -0.01%).

Influential Points: Drop or Not

The Beta and Delta models were trained on data sans the influential points from Alpha and Gamma. The differences in RMSE (11.42% and 12.08%) were somewhat significant, as were the differences in adjusted R-squared (8.98% and 9.22%), and the percent of variance explained (8.35% and 8.62%).

Prediction Accuracy

The evaluate the effects of model selection method and the treatment of outliers on prediction accuracy, the four multiregression models were evaluated for prediction accuracy on the test data. Four measures of prediction accuracy were used:

  1. MAPE - Mean Absolute Percentage Error
  2. MPE - Mean Percentage Error
  3. MSE - Mean Squared Error
  4. RMSE - Root Mean Squared Error

In addition, a percent accuracy measure was computed as the percentage of the observations in the test set in which the actual log number of IMDB votes fell within the prediction interval.

Table 22 Model Predictive Accuracy Summary
Model Size F Statistic R-Squared Adj R-Squared % Variance MAPE MPE MSE RMSE % Accuracy
Model Alpha 10 27.718 0.659 0.636 65.948 8.086 -1.179 2.139 1.463 95.935
Model Beta 11 32.773 0.715 0.693 71.457 8.169 -1.463 2.181 1.477 92.683
Model Gamma 12 25.784 0.658 0.632 65.783 8.130 -1.183 2.167 1.472 95.935
Model Delta 14 29.896 0.715 0.691 71.452 8.063 -1.636 2.136 1.461 93.496

There were no significant differences in MAPE, MSE, and RMSE between the models. The negative MPE indicates that all models were biased with over predictions. From a percent accuracy perspective, it is worth noting that the forward selection and backward selection models performed identically with and without the influence points. That said, the models with the influence points had 0.03 greater prediction accuracy. Indeed, the Alpha and Gamma models performed equally well; however, the Alpha model was able to do so with 2 fewer variables. Therefore, the most parsimonious model, Alpha would advance.

Final Multiregression Model

The final prediction equation was defined as follows:
\(y_i\) = 11.04368 + 0.589\(x_1\) + -1.009\(x_2\) + -1.905\(x_3\) + -0.684\(x_4\) + -2.28\(x_5\) + -1.261\(x_6\) + -0.493\(x_7\) + -1.886\(x_8\) + -1.081\(x_9\) + -1.585\(x_{10}\) + -0.174\(x_{11}\) + 0.012\(x_{12}\) + -1.076\(x_{13}\) + 1.09\(x_{14}\) + 0.239\(x_{15}\) + -0.456\(x_{16}\) + 0.296\(x_{17}\) + 0.045\(x_{18}\) + -0.059\(x_{19}\) + 0.38\(x_{20}\) + 0.664\(x_{21}\) + 0.342\(x_{22}\) + 0.022\(x_{23}\) + 1.09\(x_{24}\) + 0.487\(x_{25}\) + 0.293\(x_{26}\) + 0.547\(x_{27}\) + \(\epsilon\)

where:
\(x_1\) is cast_votes_log
\(x_2\) is genreAnimation coded as 0 or 1 for this genre
\(x_3\) is genreArt House & International coded as 0 or 1 for this genre
\(x_4\) is genreComedy coded as 0 or 1 for this genre
\(x_5\) is genreDocumentary coded as 0 or 1 for this genre
\(x_6\) is genreDrama coded as 0 or 1 for this genre
\(x_7\) is genreHorror coded as 0 or 1 for this genre
\(x_8\) is genreMusical & Performing Arts coded as 0 or 1 for this genre
\(x_9\) is genreMystery & Suspense coded as 0 or 1 for this genre
\(x_{10}\) is genreOther coded as 0 or 1 for this genre
\(x_{11}\) is genreScience Fiction & Fantasy coded as 0 or 1 for this genre
\(x_{12}\) is critics_score \(x_{13}\) is cast_experience_log
\(x_{14}\) is best_pic_nomyes
\(x_{15}\) is director_experience_log
\(x_{16}\) is thtr_rel_monthFeb
\(x_{17}\) is thtr_rel_monthMar coded as 0 or 1 for this genre
\(x_{18}\) is thtr_rel_monthApr coded as 0 or 1 for this month
\(x_{19}\) is thtr_rel_monthMay coded as 0 or 1 for this month
\(x_{20}\) is thtr_rel_monthJun coded as 0 or 1 for this month
\(x_{21}\) is thtr_rel_monthJul coded as 0 or 1 for this month
\(x_{22}\) is thtr_rel_monthAug coded as 0 or 1 for this month
\(x_{23}\) is thtr_rel_monthSep coded as 0 or 1 for this month
\(x_{24}\) is thtr_rel_monthOct coded as 0 or 1 for this month
\(x_{25}\) is thtr_rel_monthNov coded as 0 or 1 for this month
\(x_{26}\) is thtr_rel_monthDec coded as 0 or 1 for this month
\(x_{27}\) is best_dir_winyes coded as 0 or 1 for this month

Analysis of Variance

Figure 30 summarizes the analysis of variance.
Term Df Sum Sq Mean Sq F Statistic Pr(>F) % Var
cast_votes_log 1 1500.271 1500.271 730.444 0.000 54.31
genre 10 87.228 8.723 4.247 0.000 3.16
critics_score 1 79.978 79.978 38.939 0.000 2.90
cast_experience_log 1 39.636 39.636 19.298 0.000 1.43
best_pic_nom 1 34.029 34.029 16.568 0.000 1.23
director_experience_log 1 23.346 23.346 11.366 0.001 0.85
thtr_rel_month 11 31.592 2.872 1.398 0.170 1.14
best_dir_win 1 7.320 7.320 3.564 0.060 0.26
runtime_log 1 7.161 7.161 3.487 0.063 0.26
mpaa_rating 4 11.242 2.810 1.368 0.244 0.41
Residuals 458 940.693 2.054 NA NA 34.05

Figure 30 Model Alpha analysis of variance

A two-way analysis of variance was conducted on the influence of 10 independent variables on the log imdb votes. The force of cast_votes_log on the log imdb votes produced an F statistic of F(1, 458), = 730.444, p < .001, representing 54.31% of the variance. The influence of genre on the log imdb votes presented an F statistic of F(10, 458), = 4.247, p < .001, expressing 3.16% of the variance. The effect of critics_score on the log imdb votes presented an F statistic of F(1, 458), = 38.939, p < .001, exhibiting 2.9% of the variance. The significance of cast_experience_log on the log imdb votes yielded an F statistic of F(1, 458), = 19.298, p < .001, representing 1.43% of the variance. The significance of best_pic_nom on the log imdb votes yielded an F statistic of F(1, 458), = 16.568, p < .001, exhibiting 1.23% of the variance. The effect of director_experience_log on the log imdb votes produced an F statistic of F(1, 458), = 11.366, p < .001, exhibiting 0.85% of the variance. The significance of thtr_rel_month on the log imdb votes indicated an F statistic of F(11, 458), = 1.398, p < 0.17, accounting for 1.14% of the variance. The influence of best_dir_win on the log imdb votes produced an F statistic of F(1, 458), = 3.564, p < 0.06, accounting for 0.26% of the variance. The effect of runtime_log on the log imdb votes indicated an F statistic of F(1, 458), = 3.487, p < 0.063, accounting for 0.26% of the variance. The effect of mpaa_rating on the log imdb votes yielded an F statistic of F(4, 458), = 1.368, p < 0.244, exhibiting 0.41% of the variance. Finally, residuals expressed a 34.05% of variance. The model was significant (F(33, 458) = 27.718, p < .001), with an adjusted R-squared of 0.636.

Interpretation of Coefficients

Although there are only 10 variables, there are some 33 coefficients, a consequence of the number of levels in the categorical variables. The coefficients estimates are identified in Table 23.

Table 23: Model Alpha Coefficients
term estimate std.error statistic p.value
(Intercept) 11.044 2.218 4.980 0.000
cast_votes_log 0.589 0.030 19.827 0.000
genreAnimation -1.009 0.633 -1.592 0.112
genreArt House & International -1.905 0.521 -3.657 0.000
genreComedy -0.684 0.269 -2.543 0.011
genreDocumentary -2.280 0.389 -5.868 0.000
genreDrama -1.261 0.237 -5.322 0.000
genreHorror -0.493 0.448 -1.100 0.272
genreMusical & Performing Arts -1.886 0.602 -3.132 0.002
genreMystery & Suspense -1.081 0.310 -3.491 0.001
genreOther -1.585 0.476 -3.332 0.001
genreScience Fiction & Fantasy -0.174 0.685 -0.254 0.800
critics_score 0.012 0.003 4.450 0.000
cast_experience_log -1.076 0.201 -5.347 0.000
best_pic_nomyes 1.090 0.366 2.976 0.003
director_experience_log 0.239 0.124 1.924 0.055
thtr_rel_monthFeb -0.456 0.373 -1.222 0.222
thtr_rel_monthMar 0.296 0.312 0.950 0.343
thtr_rel_monthApr 0.045 0.328 0.138 0.890
thtr_rel_monthMay -0.059 0.330 -0.180 0.857
thtr_rel_monthJun 0.380 0.291 1.305 0.192
thtr_rel_monthJul 0.664 0.320 2.077 0.038
thtr_rel_monthAug 0.342 0.337 1.017 0.310
thtr_rel_monthSep 0.022 0.310 0.070 0.945
thtr_rel_monthOct 0.011 0.295 0.039 0.969
thtr_rel_monthNov 0.487 0.325 1.500 0.134
thtr_rel_monthDec 0.293 0.310 0.948 0.344
best_dir_winyes 0.547 0.307 1.784 0.075
runtime_log 0.616 0.338 1.824 0.069
mpaa_ratingPG -0.292 0.451 -0.648 0.518
mpaa_ratingPG-13 0.095 0.466 0.204 0.839
mpaa_ratingR -0.186 0.446 -0.416 0.677
mpaa_ratingUnrated -0.640 0.538 -1.189 0.235

The intercept estimate, 11.044 , is the regression estimate for the mean log number of IMDB votes for an action and adventure film, launched in January with zeros for all of the other variables. The other coefficient estimates adjust the estimate accordingly. Therefore a prediction for the log number of IMDB votes is equal to the intercept value, 11.044, plus a number of log IMDB votes associated with the genre of the film, plus 0.012 log IMDB votes for each point of the critics score, plus -1.076 log IMDB votes for the log number of movies in which the cast had previously appeared, plus 1.09 log IMDB votes for each minute of runtime, plus 0.239 log IMDB votes for the log number of movies directed by the film’s director, plus -0.456 log IMDB votes if the film was nominated for a best picture oscar. Lastly, a number of log IMDB votes are added if the film was not released in January, according to the coefficient estimates in Table 23.


References

John James jjames@datasciencesalon.org

20 November, 2017